De novo genome assembly depicts the immune genomic characteristics of cattle

Immunogenomic loci remain poorly understood because of their genetic complexity and size. Here, we report the de novo assembly of a cattle genome and provide a detailed annotation of the immunogenomic loci. The assembled genome contains 143 contigs (N50 ~ 74.0 Mb). In contrast to the current reference genome (ARS-UCD1.2), 156 gaps are closed and 467 scaffolds are located in our assembly. Importantly, the immunogenomic regions, including three immunoglobulin (IG) loci, four T-cell receptor (TR) loci, and the major histocompatibility complex (MHC) locus, are seamlessly assembled and precisely annotated. With the characterization of 258 IG genes and 657 TR genes distributed across seven genomic loci, we present a detailed depiction of immune gene diversity in cattle. Moreover, the MHC gene structures are integrally revealed with properly phased haplotypes. Together, our work describes a more complete cattle genome, and provides a comprehensive view of its complex immune-genome.

1.This manuscript attempts to both describe the genome assembly and immunogenetic loci, but the former is not accomplished to any great extent.I agree that as a theme to discuss the contiguity of a genome this is a useful benchmark.However, it is clear that the authors are not experts.Almost all the contemporary literature is ignored for MHC, TCR and BCR, much of which is high quality and actually confirms the structure of the genome presented.The current ARS-UCD1.2assembly is not even referenced, and complete haplotypes for many of the above regions are already published.
2. The importance of immunogenetic loci, particularly germline encoded, is in the polymorphism of each alleles, alongside the structure.For example, MHC haplotypes are defined by both in tandem.Therefore the level of detail of sequence and comparison across all the loci describes precludes publication if the focus is to remain on this level of immunogenetics.
3. Linked to the above, the sequengin error rate was mentioned for PacBio, but I didn't see it discussed at all for MinION data this is the basis of the reference used.Considering the focus on allele polymorphism and relatively low coverage this has to be a major focus of the paper, quantified and evidenced.
More minor comments 4. Line 158 and Figure 2; to clarify the situation with IGH in ARS-UCD1.2,much of the IGH is incorrectly assembled to Chromosomes 20 (positions: 71881100-71974595) and 21 (positions: 1-411439) and some of it is unplaced (e.g., NKLS02001456.1).The text and figure should reflect this as it is currently misleading. 5. Lines 177-8; While novel loci could be given provisional names in the manuscript, renaming gene segments that already have established IMGT names will create confusion.It would be more useful to retain the established IMGT nomenclature than to re-invent it.
IMGT nomenclature is pseudo-positional -it is not strictly required to maintain positional numbering in order to account for gene content variation.For example: PMID: 24934119 6. Lines 182-3 (example); there are several places in the text that suggest that the newly described assembly "corrects" the reference assembly.Care should be taken when making these statements since the two genomes are derived from different individuals.It is quite possible (and perhaps expected) that haplotypic variation in gene content exists in these repetitive immune-related gene complexes.Similarly, the new assembly is not simply a new "version" (line 192), since it is not an update to the existing reference assembly (i.e., it is an entirely different individual!) 7. Line 273; Ref 32 describes the felid MHC.A more appropriate reference for the cattle MHC, and which established the NC6-NC10 nomenclature, is: PMID: 34802191 8. Line 270; Ref 8 is for the human MHC.A more appropriate reference for cattle (MHC I) is: PMID: 24934119.Also, the lack of DP genes in cattle was already reported long ago (PMID: 2891610) 9. Line 448; "… validated by four irrelevant people."I think "independent" is maybe intended here instead of "irrelevant"?10.Line 456; please include a citation for IPD-MHC (PMID: 2891610)

Point-by-Point Response:
Reviewer #1: The paper describes a multi-platform approach to achieving a more complete characterization of the bovine genome.The authors use the more comprehensive detailing of the immunogenetics loci as a key example of how the highly contiguous genome enhances our understanding of bovine biology.
The report represents a substantial body of work that undoubtedly has value, however there are a number of issues that need to be resolved prior to re-consideration of the possible acceptance for publication.I should emphasize that my experience relates more to the immunogenetics than the genome assembly portion of the work.
Response: We greatly appreciate the reviewer's encouraging comments and constructive suggestions on our manuscript.Following these suggestions, we carried out additional comprehensive analysis of our data and revised our manuscript accordingly.

Concerns:
1 -English language -this is the most minor of the points I am raising.In several places through the manuscript the language is 'awkward' which distracts from the enjoyment of the reading but is generally not leading to mis-representation of the information presented.However, I do feel sorry for the 'irrelevant people' on line 448!It would be good, if re-submission is invited, to ensure that there is a robust editorial review of the text.
Response: We apologize for the language polishing issues and grammar errors present in our manuscript.As the reviewer mentioned, we replaced "irrelevant" with "independent".Besides, we made a substantial editing of the text to improve the clarity and readability.
2 -Scientific queries on the immunogenetics loci.There are a number of pieces of data that are interesting for not adhering to paradigms that the authors don't comment on as being notable.Examples include Figure 5h -the paradigm would be that TRB chains that utilize the DJ2 gene segment should recombine with J genes of the J2 cluster only -however in this figure there appears to be recombination of DJ2 with both J1 and J3 cluster J genes as well -could the authors comment on this?Similarly in Figure 6f the authors' data suggests the presence of 2 x DRB3 loci, 2 x DQB loci but only 1 x DQA locus.This is an intriguing observation as it is generally held that although there may be multiple DRB loci (as implied by there being a DRB3) there should only be a single DRB3 locus.Also -it is generally inferred that the DQA and DQB loci, when duplicated are usually duplicated together -so in generally you would expect either haplotype in the animal selected may genuinely have this genotypic structure (the amount of analysed data in this research area is limited) but it is of some concern that these unusual features didn't attract comment from the authors.
Response: We thank the reviewer for the important suggestions.Following these suggestions, we carried out additional analysis and revised our manuscript accordingly.
(1) In TRB locus, the V gene segments are followed by The reviewer pointed out that DJ2 gene segment should recombine with J genes of the J2 cluster only, while there appears to be recombination of DJ2 with both J1 and J3 genes in our original manuscript (Figure . 5h).
After detailed retrospect of the immune repertoire data analysis process, we found that the problem raised from the high sequence similarity of the three TRBD gene segments (Figure . R1A).Thus, the inference of D segment from a TRB transcript cannot always be precisely achieved (an example is shown in Figure . R1B).In our original analysis, the D segment with the highest alignment score was selected if multiple D segments were aligned.Following the reviewer's suggestion, we analyzed the alignment results of TRBD segments, and found that only 56.1% TRB transcripts were deduced with one unique TRBD segment, 25.5% TRB transcripts can be aligned to multiple TRBD segments, and 18.4% TRB transcripts cannot be aligned to TRBD segments at all, which can be attributed to the V(D)J rearrangement and somatic mutation process (Figure . R1C).D segment inference from IGH and TRD transcripts showed the similar results (Figure . R1C).In contrary, the J segment showed very high sequence specificity (Figure . R1D).Next, we re-analyzed the D-J pairs in TRBD transcripts that uniquely aligned to one TRBD gene (Figure .R1C), and found that TRBD-TRBJ segments can only be paired in sequential order.Moreover, we noticed that TRBV30 is located downstream of DJC (2) Regarding the MHC locus, the reviewer pointed out that there should only be a single DRB3, but MHC haplotype 1 consists of two tandem DRB3 loci in our original manuscript (Figure .6f).To fully address this point, we set up a pipeline integrated with MHC gene sequence mapping and gene full-length transcripts validation (Figure . R2A).Briefly, gene sequences of MHC II were aligned to the genome assembly and MHC haplotypes to give all possible MHC II gene loci (step 1).Then, PacBio full-length transcripts were aligned to MHC II genes to give all possible transcripts for each gene (step 2).Next, potential MHC II gene transcripts obtained from step 2 were aligned to MHC haplotypes, and a transcript was treated as credible and kept only if the transcript was fully aligned to any gene locus obtained in step 1 with a minimum overall sequence identity of 95% (step 3).Finally, all potential MHC gene loci (obtained in step 1) were checked manually, and an MHC gene locus was treated as veritable if covered by multiple full-length transcript (obtained in step3) and the gene structure was determined by integrating full-length transcripts alignment information (step 4).
Following this pipeline (Figure .R2A), the first DRB3 locus was validated by a large number of full-length transcripts, while the second DRB3 locus was determined as invalidated as only part of the DRB3 gene sequence were 3 -Literature references -in general the use of literature was not appropriate.A number of groups have published data on bovine MHC, Ig and TCR repertoires and diversity in cattle -virtually none of it was cited.For example there have been multiple papers citing the large expansion of bovine TCR repertoires (and so the observations about this in the paper aren't really that novel; it should also be noted that from an immunogenetics perspective the data presented in Figure 5 don't really describe the Ig and TCR repertoires -only the V(D)J permutations which is largely a predictable product of the number of V(D)J genes available for recombination -the text in the manuscript is inconsistent in how it refers to this).A better appreciation of the literature would have helped construct a better narrative.Similarly to claim characterization of the complete immunogenomic repertoire, the authors may also need to consider inclusion of other highly polymorphic loci that have been difficult to authoritatively characterise genomically, such as the NK receptor loci (LRC and NK gene complex), in the analysis.For my perspective the issues here and in 2 above are those that are most critical to address Response: We appreciate the reviewer's valuable comments and suggestions.
( (2) The reviewer pointed out that "the data presented in Figure 5 don't really describe the Ig and TCR repertoires -only the V(D)J permutations which is largely a predictable product of the number of V(D)J genes available for recombination".We apologize for the unclear description about 'Full length immune repertoires' in our original manuscript.We isolated PBMCs from cattle and performed PacBio HiFi full-length transcriptome sequencing.With these data, we profiled the diversity of the full-length IG/TR transcripts.By doing this, we further verified the functionality of different V/D/J genes that annotated in our assembly.Following the reviewer's suggestion, we changed the subtitle of the corresponding part of the result from "Full-length immune repertoires profiling" to "Full-length IG/TR transcriptome profiling".
(  4 -Details in methodology -in some areas the description of the methodology is insufficient.For example, the description of the 'Full length immune repertoires' had some components that need further explanation -why was cDNA derived from 4 animals?Was this cDNA pooled prior to sequencing or after (and so could subsequently be de-multiplexed)?Don't quite understand what has happened and why.
Think it should be possible to improve on the descriptions with minor effort.
Response: Regarding the 'Full length immune repertoires', we changed the subtitle of the corresponding part of the result from "Full-length immune repertoires profiling" to "Full-length IG/TR transcriptome profiling".We isolated PBMCs from four holstein cows and performed full-length RNA-seq experiments individually.The four samples from different animal serve as biological repeats.By performing PacBio HiFi full-length transcriptome sequencing, we profiled the diversity of the full-length IG/TR transcripts.By doing this, we further verified the functionality of different V/D/J genes that annotated in our assembly.
We apologize for our insufficient description in methodology.Following the reviewer's suggestions, we revised the Methods section and improved the clarity by adding experimental details (line 508-517 in our revised manuscript).

Figure
Figure.R1 Systematic analysis of D-J recombination of TRB transcripts.(A) Multiple sequence alignment of three TRBD gene segments.(B) An example of PacBio full-length TRB transcripts, with TRBV, D and J gene segments labeled with different colors.Determining the TRBD gene is challenging based on the sequence alignment.(C-D) Statistics of D gene segment usages (C) and J gene segment usages (D) from full-length transcripts.(E) The distribution of unique pairwise combinations of D-J gene segments of TRB transcripts.(F) An example of PacBio full length TRB transcripts which was rearranged with TRBV30-TRBJ2-TRBC2 gene segments.

Figure
Figure.R2 MHC gene loci validation and full-length transcripts alignment.(A) Analysis pipeline for validating MHC gene loci and structure in two haplotypes.(B) DRB3 gene locus validation with full-length transcripts following the criteria in (A).From top to bottom, the three rows represent graphic track of DRB3 gene model, read coverage of full length DRB3 transcripts and the detailed read alignments.(C) The complete MHC class II gene loci and structures in both haplotypes.The assembled contigs of haplotype are labeled in deep blue color.The validation of each gene with PacBio full-length transcripts mapping was demonstrated at the bottom.
) According to the reviewer's suggestions, we depicted the genomic structures of the NK receptor loci (LRC and NK gene complex) which are of high polymorphic (Figure.R3).Both loci were seamlessly assembled in NCBA1.0.In terms of NK gene complex, the global gene structures exhibited a high degree of similarity between NCBA1.0 and ARS-UCD1.2,except a notable difference arises in the gene numbers of KLRC1 and its nearby homologs, as illustrated in FigureR3A.NCBA1.0 consists of one KLRC1 gene and five nearby highly similar genes (KLRC1-[2-6]) while ARS-UCD1.2contains one KLRC1 gene and two KLRC1 similar genes (LOC100847738 and LOC100336869).For LRC gene complex, NCBA1.0 consists of 17 KIR genes in total, including one KIR2DL gene (2DL5A), two KIR2DS genes, eight KIR3DS genes and six KIR3DL genes (Figure.R3B), which was significantly more than ARS-UCD1.2.We included these data as Supplementary Figure16in our revised manuscript and edited the text accordingly (line 315-329).

Figure. R3
Figure.R3 Global genetic maps of NK receptor loci.(A) Genomic organization and detailed gene annotation of NKC loci in ARS-UCD1.2and NCBA1.0.(B) Genomic organization and detailed gene annotation of LRC loci in ARS-UCD1.2and NCBA1.0.ONT ultra-long reads that longer than 100 Kb and the mapping to the genomic region were drawn.The gene loci with difference between the two genomes were labeled as orange rectangles.
1) We apologize for the inadequate literature referencing of bovine immunogenetics.Following the reviewer's suggestions, we carefully reviewed the recent publications on MHC diversity and IG/TR repertoires in cattle, and incorporated the following important references in our revised manuscript: MHC I and II diversities of cattle populations have been analyzed with high throughput sequencing methods (PMID:36680506, Silwamba, Vasoya et al. 2023; PMID: 34102036, Vasoya, Oliveira et al. 2021), as well as genetic diversity studies of specific genes, such as BoLA-DRB3 (PMID: 33094557, Giovambattista, Takeshima et al. 2020; PMID: 32867670, Giovambattista, Moe et al. 2020).For database IPD-MHC, we added reference